Skip to content

mpi4py: enable spawn tests workflow by default #12591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 31, 2024

Conversation

wenduwan
Copy link
Contributor

The previous inter-communicator race condition has been fixed. Enable the workflow by default to catch new regressions.

Copy link
Member

@hppritcha hppritcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i vote for merging tests into one github action

@wenduwan wenduwan force-pushed the enable_mpi4py_spawn branch from c854585 to 442fc6f Compare May 30, 2024 20:53
@wenduwan
Copy link
Contributor Author

@jsquyres @hppritcha Merged workflows as requested.

@wenduwan wenduwan requested review from jsquyres and hppritcha May 30, 2024 20:54
@wenduwan
Copy link
Contributor Author

@hppritcha I merged #12526 and rebased this PR.

@wenduwan wenduwan requested a review from hppritcha May 31, 2024 15:20
@wenduwan
Copy link
Contributor Author

@jsquyres Do you have more comments? Merge is blocked until you approve the change.

@jsquyres
Copy link
Member

If we're not splitting up into multiple runs on the same build any more, is there any reason to split into 2 jobs (build and run_tests)? You could delete the entire upload/download of the build stuff.

@wenduwan wenduwan force-pushed the enable_mpi4py_spawn branch from ad42bed to dc3f425 Compare May 31, 2024 18:38
@wenduwan
Copy link
Contributor Author

@jsquyres Good point. I merged both ompi_mpi4py.yaml and ompi_mpi4py_tests.yaml into a single test job. Does it look better?

@jsquyres
Copy link
Member

Much better, thanks.

Do we really need to test all of these cases? I ask because the whole thing takes nearly 30 minutes. Is there value in all of these, or should we cut some of them / make the overall run shorter?

Screenshot 2024-05-31 at 4 25 56 PM

E.g., the final np=5 run with the relocated OMPI -- does that need to be with np=5? Or do we even need that at all? I.e., if singleton works with relocated, do we test anything new with np=5?

@wenduwan
Copy link
Contributor Author

Honestly I don't think it's that bad - jenkins is currently the long pole at 50mins, so running mpi4py tests for 30mins sounds benign to me(especially considering bugs that it has revealed in the past).

I'm not sure if np=5 is too big or small... if I were to write the test I might test the edge cases, e.g. singleton, np=3, np=total slots.

@dalcinl What do you think?

@hppritcha
Copy link
Member

have you watched to see if the np=5 is the slow stage?

@wenduwan
Copy link
Contributor Author

@hppritcha
Copy link
Member

i thought the vm where these github actions were running has only 4 "cpus". Maybe adding -bind-to NONE to mpiexec cmd line might help?

@jsquyres
Copy link
Member

Screenshot 2024-05-31 at 5 03 10 PM

@jsquyres
Copy link
Member

The goal with np=5 might be to intentionally stress the overscribed scenario -- and that's probably a good thing.

But do we need to do that with the relocated OMPI? I don't think so. At least for the relocated OMPI, it'll either work or it won't -- I think testing with either singleton or np=1 will tell us if it works, and that's good enough. Right?

For the other runs -- I think running singleton, np=5, and np=something_else would probably be good.

Do we need all of np=1, np=2, np=3, np=4? I.e., does running all of the np values tell us something that not running all of them wouldn't tell us?

@hppritcha
Copy link
Member

I don't think we need np 5 for relocated binary. I agree with @jsquyres aobut that one.
actually it turned out np=3 uncovered some issues with one of the collectives on intercomms or something like that because it had asymmetric local/remote groups. So lets keep singleton..5 for base testing and Jeff's suggestions for relocated package testing.

The previous inter-communicator race condition has been fixed. Enable the
workflow by default to catch new regressions.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan wenduwan force-pushed the enable_mpi4py_spawn branch from dc3f425 to f5d769b Compare May 31, 2024 22:09
@wenduwan
Copy link
Contributor Author

Ack. Removed np=5 relocate test.

@wenduwan wenduwan merged commit 71c2882 into open-mpi:main May 31, 2024
@dalcinl
Copy link
Contributor

dalcinl commented Jun 1, 2024

@wenduwan I still have to do changes on mpi4py's side for spawn tests to not be skipped under ompi@main. I'm doing a final testing here, and if that goes well, I'll merge mpi4py/mpi4py@54c0cf3.

Please keep me posted once all these spawn fixes land in v5.0.x, we will have to update things again both in mpi4py and ompi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants